Initial data: Polycot-data-from-NY.csv

Reading the data

Reading in the data (had to remove manually the ‘#’ sign from the column and save as csv file):

dat = read.table("data/Polycot-data-from-NY.csv", sep=",",header=TRUE)
str(dat)
## 'data.frame':    803 obs. of  4 variables:
##  $ plant.ID  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ cotyledons: int  2 2 2 2 2 2 2 2 2 2 ...
##  $ generation: Factor w/ 2 levels "O","p": 2 2 2 2 2 2 2 2 2 2 ...
##  $ parents   : Factor w/ 11 levels "","2x2","2x3",..: 1 1 1 1 1 1 1 1 1 1 ...

The data has 4 columns:

head(dat)
##   plant.ID cotyledons generation parents
## 1        1          2          p        
## 2        2          2          p        
## 3        3          2          p        
## 4        4          2          p        
## 5        5          2          p        
## 6        6          2          p
dat[200:210,]
##     plant.ID cotyledons generation parents
## 200      200          2          O     2x2
## 201      201          2          O     2x2
## 202      202          2          O     2x2
## 203      203          2          O     2x2
## 204      204          2          O     2x2
## 205      205          2          O     2x2
## 206      206          2          O     2x2
## 207      207          2          O     2x2
## 208      208          2          O     2x2
## 209      209          2          O     2x2
## 210      210          2          O     2x2

We can summarize basic counts. For example, we have 622 offspring plants and 181 parent plants. On average, the plants have 2.691 cotelydons (mean=2.691). The median being 3 cotelydons means that 50% of the plants have 3 cotelydons or fewer

summary(dat)
##     plant.ID       cotyledons    generation    parents   
##  Min.   :  1.0   Min.   :2.000   O:622             :181  
##  1st Qu.:201.5   1st Qu.:2.000   p:181      2x2    :152  
##  Median :402.0   Median :3.000              2x3    :139  
##  Mean   :399.2   Mean   :2.691              3x3    :109  
##  3rd Qu.:602.5   3rd Qu.:3.000              4x4    : 96  
##  Max.   :783.0   Max.   :5.000              3x4    : 87  
##                                             (Other): 39
summary(dat$parents)
##     2x2 2x3 2x4 3x3 3x4 4x4 4x5 4x6 4x7 4x8 
## 181 152 139  35 109  87  96   1   1   1   1

Data visualization

Now, we can combine groups in our data:

  • Parents
  • Offspring with both parents having 2 cotyledons
  • Offspring with one parent having more than 2 cotyledons
  • Offspring with both parents having more than 2 cotelydons
dat2 = within(dat, group <- "(2+)x(2+)")
dat2$group[dat2$parents == ""] = "parent"
dat2$group[dat2$parents == "2x2"] = "2x2"
dat2$group[dat2$parents == "2x3"] = "2x(2+)"
dat2$group[dat2$parents == "2x4"] = "2x(2+)"
dat2$group = factor(dat2$group,levels=c("parent", "2x2", "2x(2+)", "(2+)x(2+)"))

Statistical tests

Standard t-test

Standard t-test comparing:

  • Population 1: Offspring with both parents having 2 cotyledons
  • Population 2: Offspring with both parents having more than 2 cotyledons

Assumptions:

  • Data distributed normal
  • Equal variance in the two groups (we can test this with the Levene’s test later)

Plotting the densities:

Not really normal!

summary(dat2$group)
##    parent       2x2    2x(2+) (2+)x(2+) 
##       181       152       174       296
x = dat2$cotyledons[dat2$group == "2x2"]
y = dat2$cotyledons[dat2$group == "(2+)x(2+)"]

Variances look similar:

var(x)
## [1] 0.3275967
var(y)
## [1] 0.4926134

But let’s do a Levene test:

leveneTest(cotyledons ~ group, dat3, center=mean)
## Levene's Test for Homogeneity of Variance (center = mean)
##        Df F value Pr(>F)
## group   1  0.2634 0.6081
##       446

The p-value is greater than 0.05, so we do not reject the null hypothesis of equal variances.

The Levene test also assumes Normality. So, we can run the alternative Fligner-Killeen test with the same conclusion (equal variances):

fligner.test(cotyledons ~ group, dat3)
## 
##  Fligner-Killeen test of homogeneity of variances
## 
## data:  cotyledons by group
## Fligner-Killeen:med chi-squared = 0.33916, df = 1, p-value =
## 0.5603

Back to the t-test assumptions:

  • Normality: not met
  • Equality of variances: met

If we ignore the violation of normality and we run a standard t-test, we get:

t.test(x,y)
## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = -6.8072, df = 363.34, p-value = 4.126e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.5422314 -0.2991627
## sample estimates:
## mean of x mean of y 
##  2.440789  2.861486

The p-value is less than 0.05, so we reject the null hypothesis of equality of means. This implies that we have evidence that the number of cotyledons is significantly larger in the offspring from “(2+)x(2+)” than from “2x2”.

Our justification to do a standard t-test is that if the sample size is moderately large, the two-sample t-test is robust to non-normality due to the central limit theorem.

Wilcoxon-Mann-Whitney test

Alternative to the t-test when normality is violated. This test does not test equality of means, but equality of distributions, so the conclusion could be deceiving (that is, two different distributions with the same mean will have a rejected null hypothesis using WMW test).

wilcox.test(cotyledons ~ group, dat3)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  cotyledons by group
## W = 15216, p-value = 7.461e-10
## alternative hypothesis: true location shift is not equal to 0

The p-value less than 0.05 allows us to also reject the null hypothesis of equality of distributions which provides evidence that the number of cotyledons on the “(2+)x(2+)” group is significantly higher than in the “2x2” group.

One-way ANOVA

If we want to compare more than two groups, we can do so with a one-way ANOVA test. We will compare the number of cotyledons on the four groups:

  • Population 1: Parent plants
  • Population 2: Offspring 2x2
  • Population 3: Offspring 2x(2+)
  • Population 4: Offspring (2+)x(2+)

Assumptions:

  • The observations are obtained independently and randomly from the population defined by the factor levels
  • The data of each factor level are normally distributed.
  • These normal populations have a common variance
res.aov <- aov(cotyledons ~ group, data = dat2)
summary(res.aov)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## group         3   18.7   6.248   13.62 1.14e-08 ***
## Residuals   799  366.7   0.459                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We reject that all groups have the same mean number of cotyledons.

We can check the assumptions:

# Homogeneity of variance
plot(res.aov,1)

# Normality
plot(res.aov,2)

Kruskal-Wallis rank sum test

When normality is not met, we can use a Kruskal-Wallis rank sum test, with the same conclusion:

kruskal.test(cotyledons ~ group, data = dat2)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  cotyledons by group
## Kruskal-Wallis chi-squared = 39.264, df = 3, p-value = 1.526e-08

Tukey HSD pairwise tests

We can move on to do Tukey multiple pairwise-comparisons (HSD test):

TukeyHSD(res.aov)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = cotyledons ~ group, data = dat2)
## 
## $group
##                        diff          lwr        upr     p adj
## 2x2-parent       -0.2387685 -0.430637375 -0.0468997 0.0076856
## 2x(2+)-parent    -0.0473741 -0.232532612  0.1377844 0.9124834
## (2+)x(2+)-parent  0.1819285  0.017371061  0.3464859 0.0234548
## 2x(2+)-2x2        0.1913944 -0.002228051  0.3850169 0.0540364
## (2+)x(2+)-2x2     0.4206970  0.246670761  0.5947233 0.0000000
## (2+)x(2+)-2x(2+)  0.2293026  0.062703782  0.3959014 0.0023572

New data: Polycot-data-2020.csv

Reading the data

Reading in the data (had to remove manually the ‘#’ sign from the column and save as csv file):

dat = read.table("data/Polycot-data-2020.csv", sep=",",header=TRUE)
str(dat)
## 'data.frame':    5585 obs. of  2 variables:
##  $ Cotyledons: int  2 2 2 3 2 2 2 2 3 2 ...
##  $ Generation: Factor w/ 2 levels "O","P": 2 2 2 2 2 2 2 2 2 2 ...
head(dat)
##   Cotyledons Generation
## 1          2          P
## 2          2          P
## 3          2          P
## 4          3          P
## 5          2          P
## 6          2          P
summary(dat)
##    Cotyledons    Generation
##  Min.   :2.000   O:3029    
##  1st Qu.:2.000   P:2556    
##  Median :3.000             
##  Mean   :2.595             
##  3rd Qu.:3.000             
##  Max.   :4.000
dat = within(dat, Generation<-factor(Generation, levels=c("P","O")))

Data visualization

Statistical tests

Standard t-test

Standard t-test comparing:

  • Population 1: Parents
  • Population 2: Offspring

Assumptions:

  • Data distributed normal
  • Equal variance in the two groups (we can test this with the Levene’s test later)

Plotting the densities:

Not really normal!

x = dat$Cotyledons[dat$Generation == "P"]
y = dat$Cotyledons[dat$Generation == "O"]

Variances look similar:

var(x)
## [1] 0.2418511
var(y)
## [1] 0.2789809

But let’s do a Levene test:

leveneTest(Cotyledons ~ Generation, dat, center=mean)
## Levene's Test for Homogeneity of Variance (center = mean)
##         Df F value   Pr(>F)    
## group    1  45.032 2.13e-11 ***
##       5583                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is less than 0.05, so we reject the null hypothesis of equal variances.

The Levene test also assumes Normality. So, we can run the alternative Fligner-Killeen test with the same conclusion (equal variances):

fligner.test(Cotyledons ~ Generation, dat)
## 
##  Fligner-Killeen test of homogeneity of variances
## 
## data:  Cotyledons by Generation
## Fligner-Killeen:med chi-squared = 15.035, df = 1, p-value =
## 0.0001055

Back to the t-test assumptions:

  • Normality: not met
  • Equality of variances: not met

If we ignore the violations and we run a standard t-test, we get:

t.test(x,y, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  x and y
## t = -29.717, df = 5583, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.4354871 -0.3815863
## sample estimates:
## mean of x mean of y 
##  2.373239  2.781776

The p-value is less than 0.05, so we reject the null hypothesis of equality of means. This implies that we have evidence that the number of cotyledons is significantly larger in the offspring than in the parents.

Our justification to do a standard t-test is that if the sample size is moderately large, the two-sample t-test is robust to non-normality due to the central limit theorem.

We can also run the Welch version of t test that does not assume equal variances (this is the default in R):

t.test(x,y)
## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = -29.897, df = 5529.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.4353249 -0.3817485
## sample estimates:
## mean of x mean of y 
##  2.373239  2.781776

Wilcoxon-Mann-Whitney test

Alternative to the t-test when normality is violated. This test does not test equality of means, but equality of distributions, so the conclusion could be deceiving (that is, two different distributions with the same mean will have a rejected null hypothesis using WMW test).

wilcox.test(Cotyledons ~ Generation, dat)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Cotyledons by Generation
## W = 2417650, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

The p-value less than 0.05 allows us to also reject the null hypothesis of equality of distributions which provides evidence that the number of cotyledons on the offspring group is significantly higher than in the parent group.

New data: Polycot-data-2020.csv focusing on chi-square test

Reading the data

Reading in the data (had to remove manually the ‘#’ sign from the column and save as csv file):

dat = read.table("data/Polycot-data-2020.csv", sep=",",header=TRUE)
str(dat)
## 'data.frame':    5585 obs. of  2 variables:
##  $ Cotyledons: int  2 2 2 3 2 2 2 2 3 2 ...
##  $ Generation: Factor w/ 2 levels "O","P": 2 2 2 2 2 2 2 2 2 2 ...
head(dat)
##   Cotyledons Generation
## 1          2          P
## 2          2          P
## 3          2          P
## 4          3          P
## 5          2          P
## 6          2          P
summary(dat)
##    Cotyledons    Generation
##  Min.   :2.000   O:3029    
##  1st Qu.:2.000   P:2556    
##  Median :3.000             
##  Mean   :2.595             
##  3rd Qu.:3.000             
##  Max.   :4.000
dat = within(dat, Generation<-factor(Generation, levels=c("P","O")))

Creating categorical variable: dycots/polycots

dat2 = data.frame(generation=dat$Generation)
dat2$cots = "dycot"
dat2$cots[dat$Cotyledons > 2] = "polycot"
dat2 = within(dat2, cots<-factor(cots))
summary(dat2)
##  generation      cots     
##  P:2556     dycot  :2437  
##  O:3029     polycot:3148
write.table(dat2,file="data/polycot-data-2020-categories.csv", row.names = FALSE, sep=",")

dt = table(dat2$cots, dat2$generation)

Data visualization

library(graphics)
mosaicplot(dt, shade = TRUE, las=2, main="Cotelydons")

Data analysis

dt
##          
##              P    O
##   dycot   1612  825
##   polycot  944 2204
ct = chisq.test(dt)

obs = ct$observed
exp = ct$expected

df = rbind(obs[1,],exp[1,],obs[2,],exp[2,],colSums(obs))
rownames(df) = c("dycot observed", "dycot expected", "polycot observed", "polycot expected", "total")

r = rowSums(obs)
ss = sum(r)
totals = c(r[1],NA,r[2],NA,ss)
cbind(df,totals)
##                         P        O totals
## dycot observed   1612.000  825.000   2437
## dycot expected   1115.304 1321.696     NA
## polycot observed  944.000 2204.000   3148
## polycot expected 1440.696 1707.304     NA
## total            2556.000 3029.000   5585
ct$statistic
## X-squared 
##  722.1475
ct$p.value
## [1] 4.567529e-159
## expected values
2437*2556/5585
## [1] 1115.304
2437*3029/5585
## [1] 1321.696
3148*2556/5585
## [1] 1440.696
3148*3029/5585
## [1] 1707.304